Brazilian E-Commerce Public Database by Olist https://www.kaggle.com/olistbr/brazilian-ecommerce?select=olist_order_items_dataset.csv
A database foi provida pela Olist, uma empresa de marketplace situada no Brasil. A Olist conecta pequenos negócios por todo o Brasil de uma forma simples. Vendedores conseguem vender seus produtos diretamente pela Olist Store, as entregas são feitas por meio de parceiros.
Quais são os principais fatores que alavancam o número de vendas? Quais os fatores que impactam na avaliação do cliente?
Esse é o segundo notebook, onde iremos realizar clusterização e analisar os mesmos afim de promover insights. No primeiro notebook com o prefixo '01', realizamos toda a visualização do conjunto, onde foram retirados alguns insights que serão utilizados ao longo das proximas etapas.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import re
import plotly.express as px
import plotly.graph_objects as go
import dash
import scipy.stats as stats
from warnings import simplefilter
from functools import reduce
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from math import ceil
from sklearn.cluster import KMeans, DBSCAN, AffinityPropagation
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import cdist, pdist
from sklearn.metrics import silhouette_score
from matplotlib.axes._axes import _log as matplotlib_axes_logger
from IPython.display import HTML, display
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Herikc Brecher" --iversions
Author: Herikc Brecher pandas : 1.4.2 seaborn : 0.11.2 re : 2.2.1 matplotlib: 3.5.1 plotly : 5.6.0 dash : 2.6.2 numpy : 1.20.0 scipy : 1.5.4
simplefilter(action='ignore', category=Warning)
matplotlib_axes_logger.setLevel('ERROR')
%matplotlib inline
sns.set_theme()
# Seed de Aleatoriedade
seed_ = 194
np.random.seed(seed_)
display(HTML("<style>.container { width:100% !important; }</style>"))
# Carregamento de todos datasets
dtCustomers = pd.read_csv('../data/olist_customers_dataset.csv', encoding = 'utf8', dtype={'customer_zip_code_prefix': str})
dtGeolocation = pd.read_csv('../data/olist_geolocation_dataset.csv', encoding = 'utf8', dtype={'geolocation_zip_code_prefix': str})
dtOrderItems = pd.read_csv('../data/olist_order_items_dataset.csv', encoding = 'utf8')
dtOrderPayments = pd.read_csv('../data/olist_order_payments_dataset.csv', encoding = 'utf8')
dtOrderReviews = pd.read_csv('../data/olist_order_reviews_dataset.csv', encoding = 'utf8')
dtOrders = pd.read_csv('../data/olist_orders_dataset.csv', encoding = 'utf8')
dtProducts = pd.read_csv('../data/olist_products_dataset.csv', encoding = 'utf8')
dtSellers = pd.read_csv('../data/olist_sellers_dataset.csv', encoding = 'utf8')
Iremos adicionar uma variavel extra para todos os datasets que contém o Estado. A variavel adicionada é a Região referente ao Estado.
regioes = {
'AC': 'Norte',
'AL': 'Nordeste',
'AP': 'Norte',
'AM': 'Norte',
'BA': 'Nordeste',
'CE': 'Nordeste',
'DF': 'CentroOeste',
'ES': 'Sudeste',
'GO': 'CentroOeste',
'MA': 'Nordeste',
'MT': 'CentroOeste',
'MS': 'CentroOeste',
'MG': 'Sudeste',
'PA': 'Norte',
'PB': 'Nordeste',
'PR': 'Sul',
'PE': 'Nordeste',
'PI': 'Nordeste',
'RJ': 'Sudeste',
'RN': 'Nordeste',
'RS': 'Sul',
'RO': 'Norte',
'RR': 'Norte',
'SC': 'Sul',
'SP': 'Sudeste',
'SE': 'Nordeste',
'TO': 'Norte'
}
dtCustomers['regiao'] = dtCustomers['customer_state'].copy()
dtCustomers = dtCustomers.replace({'regiao': regioes})
dtCustomers.head()
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP | Sudeste |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 09790 | sao bernardo do campo | SP | Sudeste |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 01151 | sao paulo | SP | Sudeste |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 08775 | mogi das cruzes | SP | Sudeste |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP | Sudeste |
dtSellers['regiao'] = dtSellers['seller_state'].copy()
dtSellers = dtCustomers.replace({'regiao': regioes})
dtSellers.head()
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP | Sudeste |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 09790 | sao bernardo do campo | SP | Sudeste |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 01151 | sao paulo | SP | Sudeste |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 08775 | mogi das cruzes | SP | Sudeste |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP | Sudeste |
dtGeolocation['regiao'] = dtGeolocation['geolocation_state'].copy()
dtGeolocation = dtGeolocation.replace({'regiao': regioes})
dtGeolocation.head()
| geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 01037 | -23.545621 | -46.639292 | sao paulo | SP | Sudeste |
| 1 | 01046 | -23.546081 | -46.644820 | sao paulo | SP | Sudeste |
| 2 | 01046 | -23.546129 | -46.642951 | sao paulo | SP | Sudeste |
| 3 | 01041 | -23.544392 | -46.639499 | sao paulo | SP | Sudeste |
| 4 | 01035 | -23.541578 | -46.641607 | sao paulo | SP | Sudeste |
Devido ao problema em questão ser dividido em 8 datasets diferentes, iremos construir datasets mais unificados afim de facilitar o processo de tratamento dos dados e clusterização.
dtCustomers.head()
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP | Sudeste |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 09790 | sao bernardo do campo | SP | Sudeste |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 01151 | sao paulo | SP | Sudeste |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 08775 | mogi das cruzes | SP | Sudeste |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP | Sudeste |
dtGeolocation.head()
| geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 01037 | -23.545621 | -46.639292 | sao paulo | SP | Sudeste |
| 1 | 01046 | -23.546081 | -46.644820 | sao paulo | SP | Sudeste |
| 2 | 01046 | -23.546129 | -46.642951 | sao paulo | SP | Sudeste |
| 3 | 01041 | -23.544392 | -46.639499 | sao paulo | SP | Sudeste |
| 4 | 01035 | -23.541578 | -46.641607 | sao paulo | SP | Sudeste |
dtOrderItems.head()
| order_id | order_item_id | product_id | seller_id | shipping_limit_date | price | freight_value | |
|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 1 | 4244733e06e7ecb4970a6e2683c13e61 | 48436dade18ac8b2bce089ec2a041202 | 2017-09-19 09:45:35 | 58.90 | 13.29 |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 1 | e5f2d52b802189ee658865ca93d83a8f | dd7ddc04e1b6c2c614352b383efe2d36 | 2017-05-03 11:05:13 | 239.90 | 19.93 |
| 2 | 000229ec398224ef6ca0657da4fc703e | 1 | c777355d18b72b67abbeef9df44fd0fd | 5b51032eddd242adc84c38acab88f23d | 2018-01-18 14:48:30 | 199.00 | 17.87 |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 1 | 7634da152a4610f1595efa32f14722fc | 9d7a1d34a5052409006425275ba1c2b4 | 2018-08-15 10:10:18 | 12.99 | 12.79 |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 1 | ac6c3623068f30de03045865e4e10089 | df560393f3a51e74553ab94004ba5c87 | 2017-02-13 13:57:51 | 199.90 | 18.14 |
dtOrderPayments.head()
| order_id | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|
| 0 | b81ef226f3fe1789b1e8b2acac839d17 | 1 | credit_card | 8 | 99.33 |
| 1 | a9810da82917af2d9aefd1278f1dcfa0 | 1 | credit_card | 1 | 24.39 |
| 2 | 25e8ea4e93396b6fa0d3dd708e76c1bd | 1 | credit_card | 1 | 65.71 |
| 3 | ba78997921bbcdc1373bb41e913ab953 | 1 | credit_card | 8 | 107.78 |
| 4 | 42fdf880ba16b47b59251dd489d4441a | 1 | credit_card | 2 | 128.45 |
dtOrderReviews.head()
| review_id | order_id | review_score | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | |
|---|---|---|---|---|---|---|---|
| 0 | 7bc2406110b926393aa56f80a40eba40 | 73fc7af87114b39712e6da79b0a377eb | 4 | NaN | NaN | 2018-01-18 00:00:00 | 2018-01-18 21:46:59 |
| 1 | 80e641a11e56f04c1ad469d5645fdfde | a548910a1c6147796b98fdf73dbeba33 | 5 | NaN | NaN | 2018-03-10 00:00:00 | 2018-03-11 03:05:13 |
| 2 | 228ce5500dc1d8e020d8d1322874b6f0 | f9e4b658b201a9f2ecdecbb34bed034b | 5 | NaN | NaN | 2018-02-17 00:00:00 | 2018-02-18 14:36:24 |
| 3 | e64fb393e7b32834bb789ff8bb30750e | 658677c97b385a9be170737859d3511b | 5 | NaN | Recebi bem antes do prazo estipulado. | 2017-04-21 00:00:00 | 2017-04-21 22:02:06 |
| 4 | f7c4243c7fe1938f181bec41a392bdeb | 8e6bfb81e283fa7e4f11123a3fb894f1 | 5 | NaN | Parabéns lojas lannister adorei comprar pela I... | 2018-03-01 00:00:00 | 2018-03-02 10:26:53 |
dtOrders.head()
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 |
| 1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
| 2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
| 3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 |
| 4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 |
dtProducts.head()
| product_id | product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1e9e8ef04dbcff4541ed26657ea517e5 | perfumaria | 40.0 | 287.0 | 1.0 | 225.0 | 16.0 | 10.0 | 14.0 |
| 1 | 3aa071139cb16b67ca9e5dea641aaa2f | artes | 44.0 | 276.0 | 1.0 | 1000.0 | 30.0 | 18.0 | 20.0 |
| 2 | 96bd76ec8810374ed1b65e291975717f | esporte_lazer | 46.0 | 250.0 | 1.0 | 154.0 | 18.0 | 9.0 | 15.0 |
| 3 | cef67bcfe19066a932b7673e239eb23d | bebes | 27.0 | 261.0 | 1.0 | 371.0 | 26.0 | 4.0 | 26.0 |
| 4 | 9dc1a7de274444849c219cff195d0b71 | utilidades_domesticas | 37.0 | 402.0 | 4.0 | 625.0 | 20.0 | 17.0 | 13.0 |
dtSellers.head()
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP | Sudeste |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 09790 | sao bernardo do campo | SP | Sudeste |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 01151 | sao paulo | SP | Sudeste |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 08775 | mogi das cruzes | SP | Sudeste |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP | Sudeste |
listDtMergeOrders = [dtOrderItems[['order_id', 'freight_value']], dtOrderPayments, dtOrderReviews, dtOrders,\
dtOrderItems.groupby('order_id').agg({'order_item_id': max}).reset_index()]
dtGeneralOrders = reduce(lambda left, right: pd.merge(left, right, on = 'order_id'), listDtMergeOrders)
dtGeneralOrders = pd.merge(dtGeneralOrders, dtCustomers[['customer_id', 'customer_unique_id', 'customer_city', 'customer_state', 'regiao']],\
on = 'customer_id')
dtGeneralOrders.head()
| order_id | freight_value | payment_sequential | payment_type | payment_installments | payment_value | review_id | review_score | review_comment_title | review_comment_message | ... | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | order_item_id | customer_unique_id | customer_city | customer_state | regiao | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | 1 | credit_card | 2 | 72.19 | 97ca439bc427b48bc1cd7177abe71365 | 5 | NaN | Perfeito, produto entregue antes do combinado. | ... | 2017-09-13 08:59:02 | 2017-09-13 09:45:35 | 2017-09-19 18:34:16 | 2017-09-20 23:43:48 | 2017-09-29 00:00:00 | 1 | 871766c5855e863f6eccc05f988b23cb | campos dos goytacazes | RJ | Sudeste |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | 1 | credit_card | 3 | 259.83 | 7b07bacd811c4117b742569b04ce3580 | 4 | NaN | NaN | ... | 2017-04-26 10:53:06 | 2017-04-26 11:05:13 | 2017-05-04 14:35:00 | 2017-05-12 16:04:24 | 2017-05-15 00:00:00 | 1 | eb28e67c4c0b83846050ddfb8a35d051 | santa fe do sul | SP | Sudeste |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | 1 | credit_card | 5 | 216.87 | 0c5b33dea94867d1ac402749e5438e8b | 5 | NaN | Chegou antes do prazo previsto e o produto sur... | ... | 2018-01-14 14:33:31 | 2018-01-14 14:48:30 | 2018-01-16 12:36:48 | 2018-01-22 13:19:16 | 2018-02-05 00:00:00 | 1 | 3818d81c6709e39d06b2738a8d3a2474 | para de minas | MG | Sudeste |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | 1 | credit_card | 2 | 25.78 | f4028d019cb58564807486a6aaf33817 | 4 | NaN | NaN | ... | 2018-08-08 10:00:35 | 2018-08-08 10:10:18 | 2018-08-10 13:28:00 | 2018-08-14 13:32:39 | 2018-08-20 00:00:00 | 1 | af861d436cfc08b2c2ddefd0ba074622 | atibaia | SP | Sudeste |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | 1 | credit_card | 3 | 218.04 | 940144190dcba6351888cafa43f3a3a5 | 5 | NaN | Gostei pois veio no prazo determinado . | ... | 2017-02-04 13:57:51 | 2017-02-04 14:10:13 | 2017-02-16 09:46:09 | 2017-03-01 16:42:31 | 2017-03-17 00:00:00 | 1 | 64b576fb70d441e8f1b2d7d446e483c5 | varzea paulista | SP | Sudeste |
5 rows × 24 columns
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_purchase_timestamp'] = \
pd.to_datetime(dtGeneralOrders['order_purchase_timestamp'])
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_delivered_customer_date'] = \
pd.to_datetime(dtGeneralOrders['order_delivered_customer_date'])
# Mantendo somente Ano, Mes e Dia
dtGeneralOrders['order_estimated_delivery_date'] = \
pd.to_datetime(dtGeneralOrders['order_estimated_delivery_date'])
dtGeneralOrders['diff_delivery_and_estimate'] = dtGeneralOrders['order_estimated_delivery_date'] -\
dtGeneralOrders['order_delivered_customer_date']
dtGeneralOrders['diff_delivery_and_purchase'] = dtGeneralOrders['order_purchase_timestamp'] -\
dtGeneralOrders['order_delivered_customer_date']
dtGeneralOrders['diff_delivery_and_estimate'] = \
dtGeneralOrders['diff_delivery_and_estimate'].apply(lambda x: re.sub("[^0-9]", "", str(x)))
dtGeneralOrders['diff_delivery_and_purchase'] = \
dtGeneralOrders['diff_delivery_and_purchase'].apply(lambda x: re.sub("[^0-9]", "", str(x)))
dtGeneralOrders['diff_delivery_and_estimate'] = pd.to_numeric(dtGeneralOrders['diff_delivery_and_estimate'])
dtGeneralOrders['diff_delivery_and_purchase'] = pd.to_numeric(dtGeneralOrders['diff_delivery_and_purchase'])
dtGeneralOrders.head()
| order_id | freight_value | payment_sequential | payment_type | payment_installments | payment_value | review_id | review_score | review_comment_title | review_comment_message | ... | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | order_item_id | customer_unique_id | customer_city | customer_state | regiao | diff_delivery_and_estimate | diff_delivery_and_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 13.29 | 1 | credit_card | 2 | 72.19 | 97ca439bc427b48bc1cd7177abe71365 | 5 | NaN | Perfeito, produto entregue antes do combinado. | ... | 2017-09-19 18:34:16 | 2017-09-20 23:43:48 | 2017-09-29 | 1 | 871766c5855e863f6eccc05f988b23cb | campos dos goytacazes | RJ | Sudeste | 8001612.0 | 8091514.0 |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 19.93 | 1 | credit_card | 3 | 259.83 | 7b07bacd811c4117b742569b04ce3580 | 4 | NaN | NaN | ... | 2017-05-04 14:35:00 | 2017-05-12 16:04:24 | 2017-05-15 | 1 | eb28e67c4c0b83846050ddfb8a35d051 | santa fe do sul | SP | Sudeste | 2075536.0 | 17184842.0 |
| 2 | 000229ec398224ef6ca0657da4fc703e | 17.87 | 1 | credit_card | 5 | 216.87 | 0c5b33dea94867d1ac402749e5438e8b | 5 | NaN | Chegou antes do prazo previsto e o produto sur... | ... | 2018-01-16 12:36:48 | 2018-01-22 13:19:16 | 2018-02-05 | 1 | 3818d81c6709e39d06b2738a8d3a2474 | para de minas | MG | Sudeste | 13104044.0 | 8011415.0 |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 12.79 | 1 | credit_card | 2 | 25.78 | f4028d019cb58564807486a6aaf33817 | 4 | NaN | NaN | ... | 2018-08-10 13:28:00 | 2018-08-14 13:32:39 | 2018-08-20 | 1 | af861d436cfc08b2c2ddefd0ba074622 | atibaia | SP | Sudeste | 5102721.0 | 7202756.0 |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 18.14 | 1 | credit_card | 3 | 218.04 | 940144190dcba6351888cafa43f3a3a5 | 5 | NaN | Gostei pois veio no prazo determinado . | ... | 2017-02-16 09:46:09 | 2017-03-01 16:42:31 | 2017-03-17 | 1 | 64b576fb70d441e8f1b2d7d446e483c5 | varzea paulista | SP | Sudeste | 15071729.0 | 26211520.0 |
5 rows × 26 columns
dtGeneralOrders = dtGeneralOrders[['customer_unique_id', 'order_id', 'order_purchase_timestamp', 'freight_value', 'payment_type', 'payment_installments', 'payment_value',\
'review_score', 'order_status', 'order_item_id',\
'diff_delivery_and_estimate', 'diff_delivery_and_purchase',\
'customer_state', 'regiao']]
dtGeneralOrders.head()
| customer_unique_id | order_id | order_purchase_timestamp | freight_value | payment_type | payment_installments | payment_value | review_score | order_status | order_item_id | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | regiao | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 871766c5855e863f6eccc05f988b23cb | 00010242fe8c5a6d1ba2dd792cb16214 | 2017-09-13 08:59:02 | 13.29 | credit_card | 2 | 72.19 | 5 | delivered | 1 | 8001612.0 | 8091514.0 | RJ | Sudeste |
| 1 | eb28e67c4c0b83846050ddfb8a35d051 | 00018f77f2f0320c557190d7a144bdd3 | 2017-04-26 10:53:06 | 19.93 | credit_card | 3 | 259.83 | 4 | delivered | 1 | 2075536.0 | 17184842.0 | SP | Sudeste |
| 2 | 3818d81c6709e39d06b2738a8d3a2474 | 000229ec398224ef6ca0657da4fc703e | 2018-01-14 14:33:31 | 17.87 | credit_card | 5 | 216.87 | 5 | delivered | 1 | 13104044.0 | 8011415.0 | MG | Sudeste |
| 3 | af861d436cfc08b2c2ddefd0ba074622 | 00024acbcdf0a6daa1e931b038114c75 | 2018-08-08 10:00:35 | 12.79 | credit_card | 2 | 25.78 | 4 | delivered | 1 | 5102721.0 | 7202756.0 | SP | Sudeste |
| 4 | 64b576fb70d441e8f1b2d7d446e483c5 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 2017-02-04 13:57:51 | 18.14 | credit_card | 3 | 218.04 | 5 | delivered | 1 | 15071729.0 | 26211520.0 | SP | Sudeste |
dtGeneralOrders = dtGeneralOrders[dtGeneralOrders['order_status'] == 'delivered']
print(dtGeneralOrders.isna().sum())
customer_unique_id 0 order_id 0 order_purchase_timestamp 0 freight_value 0 payment_type 0 payment_installments 0 payment_value 0 review_score 0 order_status 0 order_item_id 0 diff_delivery_and_estimate 8 diff_delivery_and_purchase 8 customer_state 0 regiao 0 dtype: int64
# Dropando valores NA
dtGeneralOrders = dtGeneralOrders.dropna()
dtGeneralOrders = dtGeneralOrders.drop('order_status', axis = 1)
dtGeneralOrders.shape
(115720, 13)
dtGeneralOrders.head()
| customer_unique_id | order_id | order_purchase_timestamp | freight_value | payment_type | payment_installments | payment_value | review_score | order_item_id | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | regiao | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 871766c5855e863f6eccc05f988b23cb | 00010242fe8c5a6d1ba2dd792cb16214 | 2017-09-13 08:59:02 | 13.29 | credit_card | 2 | 72.19 | 5 | 1 | 8001612.0 | 8091514.0 | RJ | Sudeste |
| 1 | eb28e67c4c0b83846050ddfb8a35d051 | 00018f77f2f0320c557190d7a144bdd3 | 2017-04-26 10:53:06 | 19.93 | credit_card | 3 | 259.83 | 4 | 1 | 2075536.0 | 17184842.0 | SP | Sudeste |
| 2 | 3818d81c6709e39d06b2738a8d3a2474 | 000229ec398224ef6ca0657da4fc703e | 2018-01-14 14:33:31 | 17.87 | credit_card | 5 | 216.87 | 5 | 1 | 13104044.0 | 8011415.0 | MG | Sudeste |
| 3 | af861d436cfc08b2c2ddefd0ba074622 | 00024acbcdf0a6daa1e931b038114c75 | 2018-08-08 10:00:35 | 12.79 | credit_card | 2 | 25.78 | 4 | 1 | 5102721.0 | 7202756.0 | SP | Sudeste |
| 4 | 64b576fb70d441e8f1b2d7d446e483c5 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 2017-02-04 13:57:51 | 18.14 | credit_card | 3 | 218.04 | 5 | 1 | 15071729.0 | 26211520.0 | SP | Sudeste |
dtGeneralOrders = dtGeneralOrders.rename(columns = {'order_item_id': 'quantity', 'regiao': 'customer_region'})
dtGeneralOrders.head()
| customer_unique_id | order_id | order_purchase_timestamp | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | customer_state | customer_region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 871766c5855e863f6eccc05f988b23cb | 00010242fe8c5a6d1ba2dd792cb16214 | 2017-09-13 08:59:02 | 13.29 | credit_card | 2 | 72.19 | 5 | 1 | 8001612.0 | 8091514.0 | RJ | Sudeste |
| 1 | eb28e67c4c0b83846050ddfb8a35d051 | 00018f77f2f0320c557190d7a144bdd3 | 2017-04-26 10:53:06 | 19.93 | credit_card | 3 | 259.83 | 4 | 1 | 2075536.0 | 17184842.0 | SP | Sudeste |
| 2 | 3818d81c6709e39d06b2738a8d3a2474 | 000229ec398224ef6ca0657da4fc703e | 2018-01-14 14:33:31 | 17.87 | credit_card | 5 | 216.87 | 5 | 1 | 13104044.0 | 8011415.0 | MG | Sudeste |
| 3 | af861d436cfc08b2c2ddefd0ba074622 | 00024acbcdf0a6daa1e931b038114c75 | 2018-08-08 10:00:35 | 12.79 | credit_card | 2 | 25.78 | 4 | 1 | 5102721.0 | 7202756.0 | SP | Sudeste |
| 4 | 64b576fb70d441e8f1b2d7d446e483c5 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 2017-02-04 13:57:51 | 18.14 | credit_card | 3 | 218.04 | 5 | 1 | 15071729.0 | 26211520.0 | SP | Sudeste |
colunas_quantitativas = ['freight_value', 'payment_value', 'quantity',\
'diff_delivery_and_estimate', 'diff_delivery_and_purchase']
def outlier_detect(x, alpha = 0.05, nome = 'X'):
n = len(x)
mean_x = np.mean(x)
std_x = np.std(x)
numerator = max(abs(x - mean_x))
# Grubbs test
grubs_calculated = numerator / std_x
# Valor calculado
print('Grubs Test:', grubs_calculated)
# Teste T
t_value = stats.t.ppf(1 - alpha / (2 * n), n - 2)
# Formula de Grubbs
grubs_critical = ((n - 1) * np.sqrt(np.square(t_value))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value)))
# Valor Critico
print('Grubbs Valor Critico:', grubs_critical)
if grubs_critical > grubs_calculated:
print('Para a variavel', nome, 'o valor calculado é MENOR que o valor critico. Aceitamos a Hipotese de que NÃO há outliers.')
else:
print('Para a variavel', nome, 'o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers.')
print('\n')
for col in colunas_quantitativas:
outlier_detect(dtGeneralOrders[col], nome = col)
Grubs Test: 24.79253432170545 Grubbs Valor Critico: 5.053985479091401 Para a variavel freight_value o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers. Grubs Test: 50.79085515326686 Grubbs Valor Critico: 5.053985479091401 Para a variavel payment_value o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers. Grubs Test: 17.61228664665197 Grubbs Valor Critico: 5.053985479091401 Para a variavel quantity o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers. Grubs Test: 21.556369925308974 Grubbs Valor Critico: 5.053985479091401 Para a variavel diff_delivery_and_estimate o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers. Grubs Test: 20.82824266694584 Grubbs Valor Critico: 5.053985479091401 Para a variavel diff_delivery_and_purchase o valor calculado é MAIOR que o valor critico. Aceitamos a Hipotese de que HÁ há outliers.
Para essa etapa será realizado a criação de variáveis RFM para melhor analise do cluster.
Rweight = 100.0
Fweight = 1.0
Mweight = 10.0
def r_score(x):
if x <= quintiles['Recency'][.2]:
return 5
elif x <= quintiles['Recency'][.4]:
return 4
elif x <= quintiles['Recency'][.6]:
return 3
elif x <= quintiles['Recency'][.8]:
return 2
else:
return 1
def fm_score(x, c):
if x <= quintiles[c][.2]:
return 1
elif x <= quintiles[c][.4]:
return 2
elif x <= quintiles[c][.6]:
return 3
elif x <= quintiles[c][.8]:
return 4
else:
return 5
segt_map = {
r'[1-2][1-2]': 'Hibernando',
r'[1-2][3-4]': 'Em risco',
r'[1-2]5': 'Não pode perder',
r'3[1-2]': 'Quase dormindo',
r'33': 'Precisam de atenção',
r'[3-4][4-5]': 'Clientes Leais',
r'41': 'Promissores',
r'51': 'Novos Clientes',
r'[4-5][2-3]': 'Potenciais Clientes Leais',
r'5[4-5]': 'Campeões'
}
# Calcular Recência
dtRFM_R = dtGeneralOrders.groupby(by = 'customer_unique_id', as_index = False)['order_purchase_timestamp'].max()
dtRFM_R.rename(columns = {"order_purchase_timestamp": "LastPurchaseDate"}, inplace = True)
# Mantem somente yyyy-MM-dd
dtRFM_R["LastPurchaseDate"] = dtRFM_R["LastPurchaseDate"].dt.date
# Get recent invoice date and use it to calculate recency
mostRecentDate = dtGeneralOrders['order_purchase_timestamp'].dt.date.max()
dtRFM_R['Recency'] = dtRFM_R['LastPurchaseDate'].apply(lambda x: (mostRecentDate - x).days)
# Calculando frequência
dtRFM_F = dtGeneralOrders.groupby(["customer_unique_id"]).agg({"order_id":"nunique"}).reset_index()
dtRFM_F.rename(columns = {"order_id":"Frequency"}, inplace = True)
# Calculando valor monetário
dtRFM_M = dtGeneralOrders.groupby('customer_unique_id', as_index = False)['payment_value'].sum()
dtRFM_M.columns = ['customer_unique_id', 'Monetary']
# Merging dfs
dtRFM = dtRFM_R.merge(dtRFM_F, on = 'customer_unique_id')
dtRFM = dtRFM.merge(dtRFM_M, on = 'customer_unique_id').drop(columns = 'LastPurchaseDate')
dtRFM.head()
| customer_unique_id | Recency | Frequency | Monetary | |
|---|---|---|---|---|
| 0 | 0000366f3b9a7992bf8c76cfdf3221e2 | 111 | 1 | 141.90 |
| 1 | 0000b849f77a49e4a4ce2b2a4ca5be3f | 114 | 1 | 27.19 |
| 2 | 0000f46a3911fa3c0805444483337064 | 537 | 1 | 86.22 |
| 3 | 0000f6ccb0745a6a4b88665a16c9f078 | 321 | 1 | 43.62 |
| 4 | 0004aac84e0df4da2b147fca70cf8255 | 288 | 1 | 196.89 |
dtRFM.isna().sum()
customer_unique_id 0 Recency 0 Frequency 0 Monetary 0 dtype: int64
quintiles = dtRFM[['Recency', 'Frequency', 'Monetary']].quantile([.2, .4, .6, .8]).to_dict()
dtRFM['R'] = dtRFM['Recency'].apply(lambda x: r_score(x))
dtRFM['F'] = dtRFM['Frequency'].apply(lambda x: fm_score(x, 'Frequency'))
dtRFM['M'] = dtRFM['Monetary'].apply(lambda x: fm_score(x, 'Monetary'))
dtRFM['Linear_Segment'] = dtRFM['R'].map(str) + dtRFM['F'].map(str) + dtRFM['M'].map(str)
dtRFM['RFM_Score'] = ((dtRFM['R'] * Rweight) + (dtRFM['F'] * Fweight) + (dtRFM['M'] * Mweight)) / (Rweight + Fweight + Mweight)
dtRFM['SegmentByR&F'] = dtRFM['R'].map(str) + dtRFM['F'].map(str)
dtRFM['SegmentByR&F'] = dtRFM['SegmentByR&F'].replace(segt_map, regex = True)
dtRFMGrouped = dtRFM.groupby('SegmentByR&F').count()['customer_unique_id'].reset_index()
dtRFMGrouped
| SegmentByR&F | customer_unique_id | |
|---|---|---|
| 0 | Campeões | 627 |
| 1 | Clientes Leais | 1192 |
| 2 | Hibernando | 36091 |
| 3 | Novos Clientes | 18169 |
| 4 | Não pode perder | 982 |
| 5 | Promissores | 18170 |
| 6 | Quase dormindo | 18118 |
fig = px.treemap(dtRFMGrouped, path = ['SegmentByR&F'],
values = 'customer_unique_id')
fig.data[0].textinfo = 'label + text + value + percent parent'
fig.show()
dtRFM["SegmentByRFMScore"] = np.where(dtRFM['RFM_Score'] > 4.5, "Campeões", (np.where(dtRFM['RFM_Score'] > 4, "Alto Valor", \
(np.where(dtRFM['RFM_Score'] > 3, "Valor Médio", np.where(dtRFM['RFM_Score'] > 1.6, \
'Baixo Valor', 'Perdido'))))))
dtRFMGrouped2 = dtRFM.groupby('SegmentByRFMScore').count()['customer_unique_id'].reset_index()
fig = px.treemap(dtRFMGrouped2, path = ['SegmentByRFMScore'],
values = 'customer_unique_id')
fig.data[0].textinfo = 'label + text + value + percent parent'
fig.show()
dtGeneralOrders = dtGeneralOrders.merge(dtRFM, on = 'customer_unique_id')
dtGeneralOrders.head()
| customer_unique_id | order_id | order_purchase_timestamp | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | ... | Recency | Frequency | Monetary | R | F | M | Linear_Segment | RFM_Score | SegmentByR&F | SegmentByRFMScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 871766c5855e863f6eccc05f988b23cb | 00010242fe8c5a6d1ba2dd792cb16214 | 2017-09-13 08:59:02 | 13.29 | credit_card | 2 | 72.19 | 5 | 1 | 8001612.0 | ... | 350 | 1 | 72.19 | 2 | 1 | 2 | 212 | 1.990991 | Hibernando | Baixo Valor |
| 1 | eb28e67c4c0b83846050ddfb8a35d051 | 00018f77f2f0320c557190d7a144bdd3 | 2017-04-26 10:53:06 | 19.93 | credit_card | 3 | 259.83 | 4 | 1 | 2075536.0 | ... | 345 | 2 | 284.56 | 2 | 5 | 5 | 255 | 2.297297 | Não pode perder | Baixo Valor |
| 2 | eb28e67c4c0b83846050ddfb8a35d051 | c105be01f21c9df6a2579f59b95cbcd9 | 2017-09-18 16:12:04 | 11.85 | credit_card | 1 | 24.73 | 5 | 1 | 8020014.0 | ... | 345 | 2 | 284.56 | 2 | 5 | 5 | 255 | 2.297297 | Não pode perder | Baixo Valor |
| 3 | 3818d81c6709e39d06b2738a8d3a2474 | 000229ec398224ef6ca0657da4fc703e | 2018-01-14 14:33:31 | 17.87 | credit_card | 5 | 216.87 | 5 | 1 | 13104044.0 | ... | 227 | 1 | 216.87 | 3 | 1 | 4 | 314 | 3.072072 | Quase dormindo | Valor Médio |
| 4 | af861d436cfc08b2c2ddefd0ba074622 | 00024acbcdf0a6daa1e931b038114c75 | 2018-08-08 10:00:35 | 12.79 | credit_card | 2 | 25.78 | 4 | 1 | 5102721.0 | ... | 21 | 1 | 25.78 | 5 | 1 | 1 | 511 | 4.603604 | Novos Clientes | Campeões |
5 rows × 23 columns
dtGeneralOrdersProcessado = dtGeneralOrders.copy()
# Realizando LabelEncoder
labelencoderPaymentType = LabelEncoder()
labelencoderCustomerState = LabelEncoder()
labelencoderCustomerRegion = LabelEncoder()
dtGeneralOrdersProcessado['payment_type'] = labelencoderPaymentType.fit_transform(dtGeneralOrders['payment_type'])
dtGeneralOrdersProcessado['customer_state'] =\
labelencoderCustomerState.fit_transform(dtGeneralOrders['customer_state'])
dtGeneralOrdersProcessado['customer_region'] =\
labelencoderCustomerRegion.fit_transform(dtGeneralOrders['customer_region'])
dtGeneralOrdersProcessado.head()
| customer_unique_id | order_id | order_purchase_timestamp | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | ... | Recency | Frequency | Monetary | R | F | M | Linear_Segment | RFM_Score | SegmentByR&F | SegmentByRFMScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 871766c5855e863f6eccc05f988b23cb | 00010242fe8c5a6d1ba2dd792cb16214 | 2017-09-13 08:59:02 | 13.29 | 1 | 2 | 72.19 | 5 | 1 | 8001612.0 | ... | 350 | 1 | 72.19 | 2 | 1 | 2 | 212 | 1.990991 | Hibernando | Baixo Valor |
| 1 | eb28e67c4c0b83846050ddfb8a35d051 | 00018f77f2f0320c557190d7a144bdd3 | 2017-04-26 10:53:06 | 19.93 | 1 | 3 | 259.83 | 4 | 1 | 2075536.0 | ... | 345 | 2 | 284.56 | 2 | 5 | 5 | 255 | 2.297297 | Não pode perder | Baixo Valor |
| 2 | eb28e67c4c0b83846050ddfb8a35d051 | c105be01f21c9df6a2579f59b95cbcd9 | 2017-09-18 16:12:04 | 11.85 | 1 | 1 | 24.73 | 5 | 1 | 8020014.0 | ... | 345 | 2 | 284.56 | 2 | 5 | 5 | 255 | 2.297297 | Não pode perder | Baixo Valor |
| 3 | 3818d81c6709e39d06b2738a8d3a2474 | 000229ec398224ef6ca0657da4fc703e | 2018-01-14 14:33:31 | 17.87 | 1 | 5 | 216.87 | 5 | 1 | 13104044.0 | ... | 227 | 1 | 216.87 | 3 | 1 | 4 | 314 | 3.072072 | Quase dormindo | Valor Médio |
| 4 | af861d436cfc08b2c2ddefd0ba074622 | 00024acbcdf0a6daa1e931b038114c75 | 2018-08-08 10:00:35 | 12.79 | 1 | 2 | 25.78 | 4 | 1 | 5102721.0 | ... | 21 | 1 | 25.78 | 5 | 1 | 1 | 511 | 4.603604 | Novos Clientes | Campeões |
5 rows × 23 columns
colunas_quantitativas = ['freight_value', 'payment_value', 'quantity',\
'diff_delivery_and_estimate', 'diff_delivery_and_purchase', 'Recency',\
'Frequency', 'Monetary', 'RFM_Score']
def boxplot_individuais(data, columns, width = 15, height = 8):
fig = plt.figure()
fig.subplots_adjust(hspace = 0.4, wspace = 0.4)
fig.set_figheight(8)
fig.set_figwidth(15)
columns_adjust = ceil(len(columns) / 3)
for i, column in enumerate(columns):
ax = fig.add_subplot(columns_adjust, 3, i + 1)
sns.boxplot(x = data[column])
plt.tight_layout()
plt.show()
def hist_individual(data, columns, width = 10, height = 15):
fig = plt.figure()
fig.subplots_adjust(hspace = 0.4, wspace = 0.4)
fig.set_figheight(10)
fig.set_figwidth(15)
columns_adjust = ceil(len(columns) / 3)
for i, column in enumerate(columns):
ax = fig.add_subplot(columns_adjust, 3, i + 1)
data[column].hist(label = column)
plt.title(column)
plt.tight_layout()
plt.show()
boxplot_individuais(dtGeneralOrdersProcessado, colunas_quantitativas)
hist_individual(dtGeneralOrdersProcessado, colunas_quantitativas)
scaler = StandardScaler()
#scaler = MinMaxScaler()
dtGeneralOrdersNormalizado = dtGeneralOrdersProcessado.copy()
dtGeneralOrdersNormalizado[colunas_quantitativas] = scaler.fit_transform(dtGeneralOrdersProcessado[colunas_quantitativas])
dtGeneralOrdersNormalizado.head()
| customer_unique_id | order_id | order_purchase_timestamp | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | ... | Recency | Frequency | Monetary | R | F | M | Linear_Segment | RFM_Score | SegmentByR&F | SegmentByRFMScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 871766c5855e863f6eccc05f988b23cb | 00010242fe8c5a6d1ba2dd792cb16214 | 2017-09-13 08:59:02 | -0.425764 | 1 | 2 | -0.375019 | 5 | -0.353985 | -0.568903 | ... | 0.733304 | -0.223342 | -0.184848 | 2 | 1 | 2 | 212 | -0.793074 | Hibernando | Baixo Valor |
| 1 | eb28e67c4c0b83846050ddfb8a35d051 | 00018f77f2f0320c557190d7a144bdd3 | 2017-04-26 10:53:06 | -0.003328 | 1 | 3 | 0.331341 | 4 | -0.353985 | -1.293293 | ... | 0.700521 | 2.322229 | -0.058797 | 2 | 5 | 5 | 255 | -0.554388 | Não pode perder | Baixo Valor |
| 2 | eb28e67c4c0b83846050ddfb8a35d051 | c105be01f21c9df6a2579f59b95cbcd9 | 2017-09-18 16:12:04 | -0.517377 | 1 | 1 | -0.553679 | 5 | -0.353985 | -0.566653 | ... | 0.700521 | 2.322229 | -0.058797 | 2 | 5 | 5 | 255 | -0.554388 | Não pode perder | Baixo Valor |
| 3 | 3818d81c6709e39d06b2738a8d3a2474 | 000229ec398224ef6ca0657da4fc703e | 2018-01-14 14:33:31 | -0.134385 | 1 | 5 | 0.169621 | 5 | -0.353985 | 0.054807 | ... | -0.073163 | -0.223342 | -0.098974 | 3 | 1 | 4 | 314 | 0.049347 | Quase dormindo | Valor Médio |
| 4 | af861d436cfc08b2c2ddefd0ba074622 | 00024acbcdf0a6daa1e931b038114c75 | 2018-08-08 10:00:35 | -0.457574 | 1 | 2 | -0.549727 | 4 | -0.353985 | -0.923257 | ... | -1.423831 | -0.223342 | -0.212395 | 5 | 1 | 1 | 511 | 1.242777 | Novos Clientes | Campeões |
5 rows × 23 columns
boxplot_individuais(dtGeneralOrdersNormalizado, colunas_quantitativas)
hist_individual(dtGeneralOrdersNormalizado, colunas_quantitativas)
print(dtGeneralOrdersProcessado[colunas_quantitativas].skew(),\
'\nSoma:', sum(abs(dtGeneralOrdersProcessado[colunas_quantitativas].skew())))
freight_value 5.569890 payment_value 14.411085 quantity 6.339456 diff_delivery_and_estimate 2.932376 diff_delivery_and_purchase 3.844929 Recency 0.447031 Frequency 11.008832 Monetary 29.531411 RFM_Score -0.003545 dtype: float64 Soma: 74.08855421395997
print(dtGeneralOrdersNormalizado[colunas_quantitativas].skew(),\
'\nSoma:', sum(abs(dtGeneralOrdersNormalizado[colunas_quantitativas].skew())))
freight_value 5.569890 payment_value 14.411085 quantity 6.339456 diff_delivery_and_estimate 2.932376 diff_delivery_and_purchase 3.844929 Recency 0.447031 Frequency 11.008832 Monetary 29.531411 RFM_Score -0.003545 dtype: float64 Soma: 74.08855421395998
print(dtGeneralOrdersProcessado[colunas_quantitativas].kurtosis(),\
'\nSoma:', sum(abs(dtGeneralOrdersProcessado[colunas_quantitativas].kurtosis())))
freight_value 58.620450 payment_value 528.826822 quantity 68.674498 diff_delivery_and_estimate 33.761390 diff_delivery_and_purchase 39.948358 Recency -0.656545 Frequency 258.347088 Monetary 1432.134347 RFM_Score -1.271986 dtype: float64 Soma: 2422.2414848619846
print(dtGeneralOrdersNormalizado[colunas_quantitativas].kurtosis(),\
'\nSoma:', sum(abs(dtGeneralOrdersNormalizado[colunas_quantitativas].kurtosis())))
freight_value 58.620450 payment_value 528.826822 quantity 68.674498 diff_delivery_and_estimate 33.761390 diff_delivery_and_purchase 39.948358 Recency -0.656545 Frequency 258.347088 Monetary 1432.134347 RFM_Score -1.271986 dtype: float64 Soma: 2422.2414848619874
def scatter_plot_conjunto(data, columns, target, rowN = 2, colN = 2):
col_nums = 1
row_ = 0
col_ = 0
# Definindo range de Y
y_range = [data[target].min(), data[target].max()]
fig, axes = plt.subplots(nrows = rowN, ncols = colN, figsize=(12, 12))
for column in columns:
if target != column:
# Definindo range de X
x_range = [data[column].min(), data[column].max()]
# Scatter plot de X e Y
scatter_plot = data.plot(kind = 'scatter', x = column, y = target, xlim = x_range, ylim = y_range, ax = axes[row_, col_])
# Traçar linha da media de X e Y
meanX = scatter_plot.plot(x_range, [data[target].mean(), data[target].mean()], '--', color = 'red', linewidth = 1)
meanY = scatter_plot.plot([data[column].mean(), data[column].mean()], y_range, '--', color = 'red', linewidth = 1)
col_ += 1
if col_ > col_nums:
col_ = 0
row_ += 1
plt.figure(figsize = (12, 12))
sns.heatmap(dtGeneralOrdersProcessado[colunas_quantitativas].corr(), annot = True)
plt.show()
scatter_plot_conjunto(dtGeneralOrdersProcessado, colunas_quantitativas, 'freight_value', rowN = 4)
dtGeneralOrdersFinal = dtGeneralOrdersNormalizado.drop(
['order_id', 'customer_unique_id', 'order_purchase_timestamp', 'Recency', 'Frequency', 'Monetary', 'R', 'F', 'M',\
'Linear_Segment', 'SegmentByR&F', 'SegmentByRFMScore', 'diff_delivery_and_estimate', 'diff_delivery_and_purchase'], axis = 1)
dtGeneralOrdersFinal.head()
| freight_value | payment_type | payment_installments | payment_value | review_score | quantity | customer_state | customer_region | RFM_Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.425764 | 1 | 2 | -0.375019 | 5 | -0.353985 | 18 | 3 | -0.793074 |
| 1 | -0.003328 | 1 | 3 | 0.331341 | 4 | -0.353985 | 25 | 3 | -0.554388 |
| 2 | -0.517377 | 1 | 1 | -0.553679 | 5 | -0.353985 | 25 | 3 | -0.554388 |
| 3 | -0.134385 | 1 | 5 | 0.169621 | 5 | -0.353985 | 10 | 3 | 0.049347 |
| 4 | -0.457574 | 1 | 2 | -0.549727 | 4 | -0.353985 | 25 | 3 | 1.242777 |
dtGeneralOrdersFinal[dtGeneralOrdersFinal.columns.difference(colunas_quantitativas)].head()
| customer_region | customer_state | payment_installments | payment_type | review_score | |
|---|---|---|---|---|---|
| 0 | 3 | 18 | 2 | 1 | 5 |
| 1 | 3 | 25 | 3 | 1 | 4 |
| 2 | 3 | 25 | 1 | 1 | 5 |
| 3 | 3 | 10 | 5 | 1 | 5 |
| 4 | 3 | 25 | 2 | 1 | 4 |
# Determinando um range de K
k_range = range(1, 12)
# Aplicando o modelo K-Means para cada valor de K
k_means_var = [KMeans(n_clusters = k).fit(dtGeneralOrdersFinal) for k in k_range]
# Ajustando o centróide do cluster para cada modelo
intertia = [X.inertia_ for X in k_means_var]
Analisando a metrica de Elbow é verificado que o numero ideal para K seria 6. Pois, a queda da distancia passa a diminuir fortemente em K = 4, e em 6 tem uma redução brusca novamente.
# Lista de valores de Inertia (Inertia e WCSS são a mesma coisa)
inertia = []
# Loop para testar os valores de K
for n in range(2 , 12):
modelo = (KMeans(n_clusters = n,
init = 'k-means++',
n_init = 10,
max_iter = 300,
tol = 0.0001,
random_state = seed_,
algorithm = 'elkan'))
modelo.fit(dtGeneralOrdersFinal)
inertia.append(modelo.inertia_)
# Plot
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(2 , 12) , inertia , 'o')
plt.plot(np.arange(2 , 12) , inertia , '-' , alpha = 0.5)
plt.xlabel('Número de Clusters') , plt.ylabel('Inertia')
plt.show()
dtGeneralOrdersTemp = dtGeneralOrdersFinal[:57028].copy()
# Determinando um range de K
k_range = range(1, 10)
# Aplicando o modelo K-Means para cada valor de K (esta célula pode levar bastante tempo para ser executada)
k_means_var = [KMeans(n_clusters = k).fit(dtGeneralOrdersTemp) for k in k_range]
# Ajustando o centróide do cluster para cada modelo
centroids = [X.cluster_centers_ for X in k_means_var]
# Calculando a distância euclidiana de cada ponto de dado para o centróide
k_euclid = [cdist(dtGeneralOrdersTemp, cent, 'euclidean') for cent in centroids]
dist = [np.min(ke, axis = 1) for ke in k_euclid]
# Soma dos quadrados das distâncias dentro do cluster
soma_quadrados_intra_cluster = [sum(d**2) for d in dist]
# Soma total dos quadrados
soma_total = sum(pdist(dtGeneralOrdersTemp)**2)/dtGeneralOrdersTemp.shape[0]
# Soma dos quadrados entre clusters
soma_quadrados_inter_cluster = soma_total - soma_quadrados_intra_cluster
Novamente verificando a variancia dos dados de acordo com cada valor de K, encontramos que o valor ideal é 6.
# Curva de Elbow
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(k_range, soma_quadrados_inter_cluster/soma_total * 100, 'b*-')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Número de Clusters')
plt.ylabel('Percentual de Variância Explicada')
plt.title('Variância Explicada x Valor de K')
plt.savefig('export/variancia_explicada.png')
plt.show()
def testa_k(pca_data, k = 2):
# Criando um modelo
modelo = KMeans(n_clusters = k)
modelo.fit(pca_data)
# Obtém os valores mínimos e máximos e organiza o shape
x_min, x_max = pca_data[:, 0].min() - 5, pca_data[:, 0].max() - 1
y_min, y_max = pca_data[:, 1].min() + 1, pca_data[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
Z = modelo.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot das áreas dos clusters
plt.figure(figsize = (8, 8))
plt.clf()
plt.imshow(Z,
interpolation = 'nearest',
extent = (xx.min(), xx.max(), yy.min(), yy.max()),
cmap = plt.cm.Paired,
aspect = 'auto',
origin = 'lower')
plt.savefig('export/meshgrid_' + str(k) + 'k.png')
plt.show()
# Plot dos centróides
plt.figure(figsize = (13, 13))
plt.plot(pca_data[:, 0], pca_data[:, 1], 'k.', markersize = 4)
centroids = modelo.cluster_centers_
inert = modelo.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1], marker = 'x', s = 600, linewidths = 100, color = 'r', zorder = 8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.savefig('export/centroides_' + str(k) + 'k.png')
plt.show()
# Silhouette Score
labels = modelo.labels_
print(f"Silhouette Score: {silhouette_score(pca_data, labels, metric = 'euclidean', random_state = seed_)}")
# Aplica redução de dimensionalidade
pca = PCA(n_components = 2).fit_transform(dtGeneralOrdersFinal)
testa_k(pca, 4)
Silhouette Score: 0.6093958650516936
testa_k(pca, 5)
Silhouette Score: 0.6026403332796946
testa_k(pca, 6)
Silhouette Score: 0.6113627678449342
testa_k(pca, 8)
Silhouette Score: 0.5798900430267735
testa_k(pca, 10)
Silhouette Score: 0.6118385309447221
def treina_modelo(k, data, silhouette = True):
# Criando um modelo
modelo = KMeans(n_clusters = k)
modelo.fit(data)
if silhouette:
# Silhouette Score
labels = modelo.labels_
print(f"Silhouette Score: {silhouette_score(data, labels, metric = 'euclidean', random_state = seed_)}")
# Modelo com k = 4
treina_modelo(4, dtGeneralOrdersFinal)
Silhouette Score: 0.4455774376463494
# Modelo com k = 5
treina_modelo(5, dtGeneralOrdersFinal)
Silhouette Score: 0.4307782879947419
# Modelo com k = 6
treina_modelo(6, dtGeneralOrdersFinal)
Silhouette Score: 0.4051423836510225
# Modelo com k = 8
treina_modelo(8, dtGeneralOrdersFinal)
Silhouette Score: 0.3257378455573458
# Criando um modelo
modelo = KMeans(n_clusters = 4, random_state = seed_)
modelo.fit(dtGeneralOrdersFinal)
KMeans(n_clusters=4, random_state=194)
dtGeneralOrdersClusterizado = dtGeneralOrdersProcessado.copy()
dtGeneralOrdersClusterizado = dtGeneralOrdersClusterizado.drop('order_id', axis = 1)
dtGeneralOrdersClusterizado['cluster'] = modelo.labels_
dtGeneralOrdersClusterizado['payment_type'] =\
labelencoderPaymentType.inverse_transform(dtGeneralOrdersClusterizado['payment_type'])
dtGeneralOrdersClusterizado['customer_state'] =\
labelencoderCustomerState.inverse_transform(dtGeneralOrdersClusterizado['customer_state'])
dtGeneralOrdersClusterizado['customer_region'] =\
labelencoderCustomerRegion.inverse_transform(dtGeneralOrdersClusterizado['customer_region'])
metodos_pagamento = {
'credit_card': 'Crédito',
'debit_card': 'Débito'
}
dtGeneralOrdersClusterizado = dtGeneralOrdersClusterizado.replace({'payment_type': metodos_pagamento})
dtGeneralOrdersClusterizado.head()
| customer_unique_id | order_purchase_timestamp | freight_value | payment_type | payment_installments | payment_value | review_score | quantity | diff_delivery_and_estimate | diff_delivery_and_purchase | ... | Frequency | Monetary | R | F | M | Linear_Segment | RFM_Score | SegmentByR&F | SegmentByRFMScore | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 871766c5855e863f6eccc05f988b23cb | 2017-09-13 08:59:02 | 13.29 | Crédito | 2 | 72.19 | 5 | 1 | 8001612.0 | 8091514.0 | ... | 1 | 72.19 | 2 | 1 | 2 | 212 | 1.990991 | Hibernando | Baixo Valor | 0 |
| 1 | eb28e67c4c0b83846050ddfb8a35d051 | 2017-04-26 10:53:06 | 19.93 | Crédito | 3 | 259.83 | 4 | 1 | 2075536.0 | 17184842.0 | ... | 2 | 284.56 | 2 | 5 | 5 | 255 | 2.297297 | Não pode perder | Baixo Valor | 1 |
| 2 | eb28e67c4c0b83846050ddfb8a35d051 | 2017-09-18 16:12:04 | 11.85 | Crédito | 1 | 24.73 | 5 | 1 | 8020014.0 | 8181218.0 | ... | 2 | 284.56 | 2 | 5 | 5 | 255 | 2.297297 | Não pode perder | Baixo Valor | 1 |
| 3 | 3818d81c6709e39d06b2738a8d3a2474 | 2018-01-14 14:33:31 | 17.87 | Crédito | 5 | 216.87 | 5 | 1 | 13104044.0 | 8011415.0 | ... | 1 | 216.87 | 3 | 1 | 4 | 314 | 3.072072 | Quase dormindo | Valor Médio | 2 |
| 4 | af861d436cfc08b2c2ddefd0ba074622 | 2018-08-08 10:00:35 | 12.79 | Crédito | 2 | 25.78 | 4 | 1 | 5102721.0 | 7202756.0 | ... | 1 | 25.78 | 5 | 1 | 1 | 511 | 4.603604 | Novos Clientes | Campeões | 1 |
5 rows × 23 columns
dtClusters = pd.DataFrame()
dtClusters['Preço'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['payment_value'].mean(), 2)
dtClusters['Frete'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['freight_value'].mean(), 2)
dtClusters['Parcelas'] =\
round(dtGeneralOrdersClusterizado.groupby('cluster')['payment_installments'].mean(), 2)
dtClusters['Score'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['review_score'].mean(), 2)
dtClusters['Itens'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['quantity'].mean(), 2)
dtClusters['RFM'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['RFM_Score'].mean(), 2)
dtClusters['Pagamento'] =\
dtGeneralOrdersClusterizado.groupby('cluster')['payment_type'].agg(lambda x: x.value_counts().index[0])
dtClusters['Estado'] =\
dtGeneralOrdersClusterizado.groupby('cluster')['customer_state'].agg(lambda x: x.value_counts().index[0])
dtClusters['Região'] =\
dtGeneralOrdersClusterizado.groupby('cluster')['customer_region'].agg(lambda x: x.value_counts().index[0])
dtClusters
| Preço | Frete | Parcelas | Score | Itens | RFM | Pagamento | Estado | Região | |
|---|---|---|---|---|---|---|---|---|---|
| cluster | |||||||||
| 0 | 165.98 | 23.13 | 2.40 | 3.96 | 1.33 | 2.95 | Crédito | RJ | Sudeste |
| 1 | 130.20 | 15.62 | 1.77 | 4.16 | 1.37 | 3.05 | Crédito | SP | Sudeste |
| 2 | 187.74 | 23.64 | 3.16 | 4.01 | 1.38 | 2.97 | Crédito | MG | Sudeste |
| 3 | 320.17 | 23.27 | 8.41 | 3.98 | 1.67 | 3.03 | Crédito | SP | Sudeste |
dtClusters.to_csv('analise_clusters_geral.csv')
dtClustersIndexado = dtClusters.reset_index()
dtClustersIndexado = dtClustersIndexado.rename(columns = {'cluster': 'Cluster'})
dtClustersIndexado['Cluster'] = dtClustersIndexado['Cluster'].apply(lambda x: 'Cluster ' + str(x))
dtClustersIndexado
| Cluster | Preço | Frete | Parcelas | Score | Itens | RFM | Pagamento | Estado | Região | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cluster 0 | 165.98 | 23.13 | 2.40 | 3.96 | 1.33 | 2.95 | Crédito | RJ | Sudeste |
| 1 | Cluster 1 | 130.20 | 15.62 | 1.77 | 4.16 | 1.37 | 3.05 | Crédito | SP | Sudeste |
| 2 | Cluster 2 | 187.74 | 23.64 | 3.16 | 4.01 | 1.38 | 2.97 | Crédito | MG | Sudeste |
| 3 | Cluster 3 | 320.17 | 23.27 | 8.41 | 3.98 | 1.67 | 3.03 | Crédito | SP | Sudeste |
headerColor = 'grey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'
fig = go.Figure(data=[go.Table(
header=dict(
values=list(dtClustersIndexado.columns),
line_color='darkslategray',
fill_color=headerColor,
align=['left','center'],
font=dict(color='white', size=12)
),
cells=dict(
values=[dtClustersIndexado.Cluster, dtClustersIndexado['Preço'], dtClustersIndexado.Frete, dtClustersIndexado.Parcelas,\
dtClustersIndexado.Score, dtClustersIndexado.Itens, dtClustersIndexado.RFM,\
dtClustersIndexado.Pagamento, dtClustersIndexado.Estado,\
dtClustersIndexado['Região']],
line_color='darkslategray',
fill_color = [[rowOddColor,rowEvenColor,rowOddColor, rowEvenColor]*10],
align = ['left', 'center'],
font = dict(color = 'darkslategray', size = 11)
))
])
fig.show()
dtGeneralOrdersCluster2 = dtGeneralOrdersNormalizado[['Recency', 'Frequency', 'Monetary', 'RFM_Score']]
dtGeneralOrdersCluster2.head()
| Recency | Frequency | Monetary | RFM_Score | |
|---|---|---|---|---|
| 0 | 0.733304 | -0.223342 | -0.184848 | -0.793074 |
| 1 | 0.700521 | 2.322229 | -0.058797 | -0.554388 |
| 2 | 0.700521 | 2.322229 | -0.058797 | -0.554388 |
| 3 | -0.073163 | -0.223342 | -0.098974 | 0.049347 |
| 4 | -1.423831 | -0.223342 | -0.212395 | 1.242777 |
# Criando um modelo
modelo = KMeans(n_clusters = 4, random_state = seed_)
modelo.fit(dtGeneralOrdersCluster2)
KMeans(n_clusters=4, random_state=194)
dtGeneralOrdersClusterizado = dtGeneralOrdersProcessado.copy()
dtGeneralOrdersClusterizado['cluster'] = modelo.labels_
dtClusters = pd.DataFrame()
dtClusters['Recency'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['Recency'].mean(), 2)
dtClusters['Frequency'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['Frequency'].mean(), 2)
dtClusters['Monetary'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['Monetary'].mean(), 2)
dtClusters['RFM'] = round(dtGeneralOrdersClusterizado.groupby('cluster')['RFM_Score'].mean(), 2)
dtClusters
| Recency | Frequency | Monetary | RFM | |
|---|---|---|---|---|
| cluster | ||||
| 0 | 395.67 | 1.00 | 296.31 | 1.64 |
| 1 | 133.88 | 1.00 | 296.32 | 3.90 |
| 2 | 221.00 | 2.23 | 682.74 | 3.27 |
| 3 | 221.85 | 1.49 | 26987.42 | 3.20 |
dtClusters.to_csv('analise_clusters_RFM.csv')
dtClustersIndexado = dtClusters.reset_index()
dtClustersIndexado = dtClustersIndexado.rename(columns = {'cluster': 'Cluster'})
dtClustersIndexado['Cluster'] = dtClustersIndexado['Cluster'].apply(lambda x: 'Cluster ' + str(x))
dtClustersIndexado
| Cluster | Recency | Frequency | Monetary | RFM | |
|---|---|---|---|---|---|
| 0 | Cluster 0 | 395.67 | 1.00 | 296.31 | 1.64 |
| 1 | Cluster 1 | 133.88 | 1.00 | 296.32 | 3.90 |
| 2 | Cluster 2 | 221.00 | 2.23 | 682.74 | 3.27 |
| 3 | Cluster 3 | 221.85 | 1.49 | 26987.42 | 3.20 |
headerColor = 'grey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'
fig = go.Figure(data=[go.Table(
header=dict(
values=list(dtClustersIndexado.columns),
line_color='darkslategray',
fill_color=headerColor,
align=['left','center'],
font=dict(color='white', size=12)
),
cells=dict(
values=[dtClustersIndexado.Cluster, dtClustersIndexado.Recency, dtClustersIndexado.Frequency, dtClustersIndexado.Monetary\
, dtClustersIndexado.RFM],
line_color='darkslategray',
fill_color = [[rowOddColor,rowEvenColor,rowOddColor, rowEvenColor]*10],
align = ['left', 'center'],
font = dict(color = 'darkslategray', size = 11)
))
])
fig.show()